Cross-genre Gender Identification in Russian Texts Using Topic Modeling Working Note: Team DUBL
نویسندگان
چکیده
In this paper, we describe the results of gender identification from Team DUBL. We used a topic modeling approach for identifying the author’s gender based on his/her written texts. The model was trained on the RusProfiling PAN 2017 Twitter Corpus that contains data in the Russian language. Themodel has been evaluated on texts of other genres, including texts such as letters to a friend, online reviews, Facebook posts and etc. Our model has obtained competitive results and has been shown to outperform more sophisticated algorithms on gender identification.
منابع مشابه
The Winning Approach to Cross-Genre Gender Identification in Russian at RUSProfiling 2017
We present the CIC systems submitted to the 2017 PAN shared task on Cross-Genre Gender Identification in Russian texts (RUSProfiling). We submitted five systems. One of them was based on a statistical approach using only lexical features, and other four on machine-learning techniques using some combinations of genderspecific Russian grammatical features, word and character n-grams, and suffix n...
متن کاملOverview of the RUSProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian
Author profiling consists of predicting some author’s traits (e.g. age, gender, personality) from her writing. After addressing at PAN@CLEF mainly age and gender identification, in this RusProfiling PAN@FIRE track we have addressed the problem of predicting author’s gender in Russian from a cross-genre perspective: given a training set on Twitter, the systems have been evaluated on five differe...
متن کاملGender Prediction for Authors of Russian Texts Using Regression And Classification Techniques
Automatic extraction of information about authors of texts (gender, age, psychological type, etc.) based on the analysis of linguistic parameters has gained a particular significance as there are more online texts whose authors either avoid providing any personal data or make it intentionally deceptive despite of it being of practical importance in marketing, forensics, sociology. These studies...
متن کاملApplying Multi-Dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres
The paper presents an application of Multidimensional (MD) analysis initially developed for the analysis of register variation in English (Biber, 1988) to the investigation of a genre diverse corpus, which was built from modern texts of the Russian Web. The analysis is based on the idea that each linguistic feature has different frequencies in different registers, and statistically stable co-oc...
متن کاملOverview of the PAN/CLEF 2015 Evaluation Lab
This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of text mining research focusing on the identification of personal traits of authors left behind in texts unintentionally. PAN 2015 comprises three tasks: plagiarism detection, author identification and author profiling studying important variations of these problem...
متن کامل